M.Shumskiy
Date
The type of venues that are expected to be found in each region seem to be related to some paramenters.
In this project, we analyze the venues from each region (parish), cluster them and explore the relation between cluster geographical distribution and the region data.
The objective is to analyze the cities' dynamics based on venues data and, most importantly, to find which region parameter influences more the type of venues found there.
In this project I clustered the regions based on their venues data.
There was also observed an interesting phenomenom, the cluster propagation.
This project was an assignment of IBM Data Science Professional Certification. The purpose of the project was to use Foursquare API to Segment and Cluster Regions based of the most common venues of each region. I decided to apply this method to 3 cities of Portugal, 2 more rural cities and one urban (Porto city).
All of these 3 cities I am familiar with, therefore it seemed to be interesting to study their dynamics and compare them.
The cities have an intrinsic dynamic of the venues distriution, which may vary from region to region. The more central ones are expected to differ from the more peripheral ones. However, what affects more this distribution? Is it unemployment, average age, population density, or plain distance to the center of the city? By comparing all of these parameters we can find some interesting results that give an insight to the city dynamics in relation to venue distribution.
To perform this analysis I used python as a programming language, gathered geographical data in form of a JSON file found in Spatial Data Repository of NYU, and region data from portuguese institute of statistics.
First I had to edit the JEON file to clean data of parishes borders, then I gathered information of each parish such as unemployment, average age and population density. Then I performed an analysis on the cities of Abrantes, Tomar and Porto using the Foursquare API, clustering using KMeans algorithm and then comparing he cluster distribution in relation to the parish data.
The data I used was:
This section will show and explore the code I used to conduct the analysis.
Importation of the needed libraries.
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json # library to handle JSON files
#!conda install -c conda-forge geopy --yes # uncomment this line if you dont have it installed
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
# import k-means from clustering stage
from sklearn.cluster import KMeans
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you dont have it installed
import folium # map rendering library
Foursquare credentials
CLIENT_ID = 'CALYJZ5X1YJIUSGJZDZ3QRZFNQPYVCNRMDGK0H5JQLFPQJK2' # your Foursquare ID
CLIENT_SECRET = '2WLM3YOORX5TN0EFJSR0N1PQH514CQPD2LTFOHHZY1N1FRVO' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100
This function retrieves nearby venues from Foursquare
def getNearbyVenues(names, lat, lng, distance):
venues_list=[]
for name, lat, lng, distance in zip(names, lat, lng, distance):
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lng,
distance,
LIMIT)
# make the GET request
results = requests.get(url).json()["response"]['groups'][0]['items']
# return only relevant information for each nearby venue
venues_list.append([(
name,
lat,
lng,
v['venue']['name'],
v['venue']['location']['lat'],
v['venue']['location']['lng'],
v['venue']['categories'][0]['name']) for v in results])
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['parish',
'freg lat',
'freg lng',
'Venue',
'Venue lat',
'Venue lng',
'Venue cat']
return(nearby_venues)
Now that we heve the general part of the code, we can get to the specifics.
We will start with Abrantes.
JSON file importation.
abrantes= r'C:\Users\Pc\Desktop\Project\abrantes5.json'
with open(abrantes) as ab:
ab_data=ab.readlines()
ab_data=[json.loads(line) for line in ab_data]
ab_data=ab_data[0]
with open(abrantes) as ab:
abr = json.load(ab)
# Append a tooltip column with customised text
tooltip_text=[]
for idx in range(19):
tooltip_text.append(ab_data['features'][idx]['properties'])
Excel file with region (parish) data importation.
coord_abrantes=r'C:\Users\Pc\Desktop\Project\abrantes.xlsx'
df_abrantes = pd.read_excel(coord_abrantes)
df_abrantes=df_abrantes.drop(columns=['id'])
print('the data size is: ' + str(df_abrantes.shape))
df_abrantes.head()
Here you can see part of the data relative to 19 parishes of Abrantes. We will use average age, unemployment rate and population density.
The coordinates indicate the location around which will be gathered venues information for each parish.
Note that the location of these points does not represent the geographical center of the parish, but rather a geographical point around which makes most sense to gather venue's information (i.e. makes no sense to look for venues in not populated areas such as forests).
Now we can start to gather venue information.
abrantes_venues = getNearbyVenues(names=df_abrantes['parish'],
lat=df_abrantes['lat'],
lng=df_abrantes['lng'],
distance=df_abrantes['distance'])
This is the result of the search (note that not all the data is displayed in this table for sake of space economy).
print(abrantes_venues.shape)
abrantes_venues.head()
Sampled a total of 76 venues.
abrantes_venues.groupby('parish').count()
The above table shows the number of venues gathered for each parish.
Now we perform one-hot encoding to allow the KMeans algorithm to do his job.
# one hot encoding
abrantes_onehot = pd.get_dummies(abrantes_venues[['Venue cat']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
abrantes_onehot['parish'] = abrantes_venues['parish']
# move neighborhood column to the first column
fixed_columns = [abrantes_onehot.columns[-1]] + list(abrantes_onehot.columns[:-1])
abrantes_onehot = abrantes_onehot[fixed_columns]
print(abrantes_onehot.shape)
abrantes_onehot.head()
And transform the the table into one that givesthe information in a more relevant format.
abrantes_grouped = abrantes_onehot.groupby('parish').mean().reset_index()
print(abrantes_grouped.shape)
abrantes_grouped
Now we will sample the most common venues in each parish.
First we must define the function that extracts the most common venues.
def return_most_common_venues(row, num_top_venues):
row_categories = row.iloc[1:]
row_categories_sorted = row_categories.sort_values(ascending=False)
return row_categories_sorted.index.values[0:num_top_venues]
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['parish']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['parish'] = abrantes_grouped['parish']
for ind in np.arange(abrantes_grouped.shape[0]):
neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(abrantes_grouped.iloc[ind, :], num_top_venues)
To not make this article too extensive, we will display the results of the previous cell after the KMeans algorithm does his job.
# set number of clusters
kclusters = 6 #run algorithm to choose best
abrantes_grouped_clustering = abrantes_grouped.drop('parish', 1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(abrantes_grouped_clustering)
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
#neighborhoods_venues_sorted.head()
abrantes_merged = df_abrantes
abrantes_merged.columns=['parish','average age','unemployment','lat','lng','distance','pop density']
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
abrantes_merged.set_index('parish')
abrantes_merged = abrantes_merged.join(neighborhoods_venues_sorted.set_index('parish'), on='parish')
abrantes_merged
from scipy.spatial.distance import cdist
distortions = []
K = range(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k, random_state=0).fit(abrantes_grouped_clustering)
#kmeanModel.fit(istanbul_grouped_clustering)
distortions.append(sum(np.min(cdist(abrantes_grouped_clustering, kmeanModel.cluster_centers_, 'canberra'), axis=1)) / abrantes_grouped_clustering.shape[0])
# Plot the elbow
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()
Now we have our full dataset for abrantes city.
I had to push the limits by choosing number of clusters to be 6 because otherwise I would end up with little cluster diversity.
threshold_scale = np.linspace(abrantes_merged['unemployment'].min(),
abrantes_merged['unemployment'].max(),
10, dtype=int)
threshold_scale = threshold_scale.tolist() # change the numpy array to a list
threshold_scale[-1] = threshold_scale[-1] + 1
map_clusters_abrantes_unemployment = folium.Map(location=[39.464805,-8.199648], zoom_start=11)
choropleth = folium.Choropleth(
geo_data = abr,
name = 'choropleth',
data = abrantes_merged,
columns = ['parish','unemployment'],
key_on = 'feature.properties.name',
threshold_scale=threshold_scale,
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.4,
legend_name='unemployment %',
#reset=True
).add_to(map_clusters_abrantes_unemployment)
folium.LayerControl().add_to(map_clusters_abrantes_unemployment)
choropleth.geojson.add_child(
folium.features.GeoJsonTooltip(['name'], labels=False)
).add_to(map_clusters_abrantes_unemployment)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(abrantes_merged['lat'],
abrantes_merged['lng'],
abrantes_merged['parish'],
abrantes_merged['Cluster Labels']):
label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=5,
popup=label,
color=rainbow[cluster-1],
fill=True,
fill_color=rainbow[cluster-1],
fill_opacity=0.7).add_to(map_clusters_abrantes_unemployment)
threshold_scale = np.linspace(abrantes_merged['average age'].min(),
abrantes_merged['average age'].max(),
10, dtype=int)
threshold_scale = threshold_scale.tolist() # change the numpy array to a list
threshold_scale[-1] = threshold_scale[-1] + 1
map_clusters_abrantes_age = folium.Map(location=[39.464805,-8.199648], zoom_start=11)
choropleth = folium.Choropleth(
geo_data = abr,
name = 'choropleth',
data = abrantes_merged,
columns = ['parish','average age'],
key_on = 'feature.properties.name',
threshold_scale=threshold_scale,
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.4,
legend_name='average age',
#reset=True
).add_to(map_clusters_abrantes_age)
folium.LayerControl().add_to(map_clusters_abrantes_age)
choropleth.geojson.add_child(
folium.features.GeoJsonTooltip(['name'], labels=False)
).add_to(map_clusters_abrantes_age)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(abrantes_merged['lat'],
abrantes_merged['lng'],
abrantes_merged['parish'],
abrantes_merged['Cluster Labels']):
label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=5,
popup=label,
color=rainbow[cluster-1],
fill=True,
fill_color=rainbow[cluster-1],
fill_opacity=0.7).add_to(map_clusters_abrantes_age)
threshold_scale = np.linspace(abrantes_merged['pop density'].min(),
abrantes_merged['pop density'].max(),
10, dtype=int)
threshold_scale = threshold_scale.tolist() # change the numpy array to a list
threshold_scale[-1] = threshold_scale[-1] + 1
map_clusters_abrantes_denspop = folium.Map(location=[39.464805,-8.199648], zoom_start=11)
choropleth = folium.Choropleth(
geo_data = abr,
name = 'choropleth',
data = abrantes_merged,
columns = ['parish','pop density'],
key_on = 'feature.properties.name',
threshold_scale=threshold_scale,
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.4,
legend_name='population density (Residents per km^2)',
#reset=True
).add_to(map_clusters_abrantes_denspop)
folium.LayerControl().add_to(map_clusters_abrantes_denspop)
choropleth.geojson.add_child(
folium.features.GeoJsonTooltip(['name'], labels=False)
).add_to(map_clusters_abrantes_denspop)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(abrantes_merged['lat'],
abrantes_merged['lng'],
abrantes_merged['parish'],
abrantes_merged['Cluster Labels']):
label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=5,
popup=label,
color=rainbow[cluster-1],
fill=True,
fill_color=rainbow[cluster-1],
fill_opacity=0.7).add_to(map_clusters_abrantes_denspop)
The process is analogous to Abrantes, so I will not explain every part of the code here.
tomar= r'C:\Users\Pc\Desktop\Project\tomar.json'
with open(tomar) as tm:
tmr = json.load(tm)
with open(tomar) as tm:
tm_data=tm.readlines()
tm_data=[json.loads(line) for line in tm_data]
tm_data=tm_data[0]
tooltip_text=[]
for idx in range(16):
tooltip_text.append(tm_data['features'][idx]['properties'])
coord_tomar=r'C:\Users\Pc\Desktop\Project\tomar.xlsx'
df_tomar = pd.read_excel(coord_tomar)
tomar_venues = getNearbyVenues(names=df_tomar['parish'],
lat=df_tomar['lat'],
lng=df_tomar['lng'],
distance=df_tomar['distance'])
print(tomar_venues.shape)
tomar_venues.head()
tomar_venues.groupby('parish').count()
# one hot encoding
tomar_onehot = pd.get_dummies(tomar_venues[['Venue cat']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
tomar_onehot['parish'] = tomar_venues['parish']
# move neighborhood column to the first column
fixed_columns = [tomar_onehot.columns[-1]] + list(tomar_onehot.columns[:-1])
tomar_onehot = tomar_onehot[fixed_columns]
print(abrantes_onehot.shape)
abrantes_onehot.head()
tomar_grouped = tomar_onehot.groupby('parish').mean().reset_index()
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['parish']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
neighborhoods_venues_sorted_tomar = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted_tomar['parish'] = tomar_grouped['parish']
for ind in np.arange(tomar_grouped.shape[0]):
neighborhoods_venues_sorted_tomar.iloc[ind, 1:] = return_most_common_venues(tomar_grouped.iloc[ind, :], num_top_venues)
# set number of clusters
kclusters = 6 #run algorithm to choose best
tomar_grouped_clustering = tomar_grouped.drop('parish', 1)
# run k-means clustering
kmeanst = KMeans(n_clusters=kclusters, random_state=0).fit(tomar_grouped_clustering)
# add clustering labels
neighborhoods_venues_sorted_tomar.insert(0, 'Cluster Labels', kmeanst.labels_)
#neighborhoods_venues_sorted.head()
tomar_merged = df_tomar
tomar_merged.columns=['parish','average age','unemployment','lat','lng','distance','pop density']
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
tomar_merged.set_index('parish')
tomar_merged = tomar_merged.join(neighborhoods_venues_sorted_tomar.set_index('parish'), on='parish')
tomar_merged=tomar_merged.drop([0,2,13])
tomar_merged # check the last columns!
from scipy.spatial.distance import cdist
distortions = []
K = range(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k, random_state=0).fit(tomar_grouped_clustering)
#kmeanModel.fit(istanbul_grouped_clustering)
distortions.append(sum(np.min(cdist(tomar_grouped_clustering, kmeanModel.cluster_centers_, 'canberra'), axis=1)) / tomar_grouped_clustering.shape[0])
# Plot the elbow
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()
Here I also had to choose a higher number of clusters for the same reason that I did for Abrantes, i.e. to get some cluster diversity.
Here we have Tomar dataset. Note that some regions had to be dropped because of the lack of venue data.
threshold_scale = np.linspace(tomar_merged['unemployment'].min(),
tomar_merged['unemployment'].max(),
10, dtype=int)
threshold_scale = threshold_scale.tolist() # change the numpy array to a list
threshold_scale[-1] = threshold_scale[-1] + 1
map_clusters_tomar_unemployment = folium.Map(location=[39.602530,-8.409337], zoom_start=11)
choropleth = folium.Choropleth(
geo_data = tmr,
name = 'choropleth',
data = tomar_merged,
columns = ['parish','unemployment'],
key_on = 'feature.properties.name',
threshold_scale=threshold_scale,
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.4,
legend_name='unemployment %',
#reset=True
).add_to(map_clusters_tomar_unemployment)
folium.LayerControl().add_to(map_clusters_tomar_unemployment)
choropleth.geojson.add_child(
folium.features.GeoJsonTooltip(['name'], labels=False)
).add_to(map_clusters_tomar_unemployment)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(tomar_merged['lat'],
tomar_merged['lng'],
tomar_merged['parish'],
tomar_merged['Cluster Labels']):
label = folium.Popup(str(poi) + ' Cluster ' + str(int(cluster)), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=5,
popup=label,
color=rainbow[int(cluster)-1],
fill=True,
fill_color=rainbow[int(cluster)-1],
fill_opacity=0.7).add_to(map_clusters_tomar_unemployment)
threshold_scale = np.linspace(tomar_merged['average age'].min(),
tomar_merged['average age'].max(),
10, dtype=int)
threshold_scale = threshold_scale.tolist() # change the numpy array to a list
threshold_scale[-1] = threshold_scale[-1] + 1
map_clusters_tomar_age = folium.Map(location=[39.602530,-8.409337], zoom_start=11)
choropleth = folium.Choropleth(
geo_data = tmr,
name = 'choropleth',
data = tomar_merged,
columns = ['parish','average age'],
key_on = 'feature.properties.name',
threshold_scale=threshold_scale,
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.4,
legend_name='average age',
#reset=True
).add_to(map_clusters_tomar_age)
folium.LayerControl().add_to(map_clusters_tomar_age)
choropleth.geojson.add_child(
folium.features.GeoJsonTooltip(['name'], labels=False)
).add_to(map_clusters_tomar_age)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(tomar_merged['lat'],
tomar_merged['lng'],
tomar_merged['parish'],
tomar_merged['Cluster Labels']):
label = folium.Popup(str(poi) + ' Cluster ' + str(int(cluster)), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=5,
popup=label,
color=rainbow[int(cluster)-1],
fill=True,
fill_color=rainbow[int(cluster)-1],
fill_opacity=0.7).add_to(map_clusters_tomar_age)
threshold_scale = np.linspace(tomar_merged['pop density'].min(),
tomar_merged['pop density'].max(),
10, dtype=int)
threshold_scale = threshold_scale.tolist() # change the numpy array to a list
threshold_scale[-1] = threshold_scale[-1] + 1
map_clusters_tomar_popdens = folium.Map(location=[39.602530,-8.409337], zoom_start=11)
choropleth = folium.Choropleth(
geo_data = tmr,
name = 'choropleth',
data = tomar_merged,
columns = ['parish','pop density'],
key_on = 'feature.properties.name',
threshold_scale=threshold_scale,
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.4,
legend_name='population density (Residents per km^2)',
#reset=True
).add_to(map_clusters_tomar_popdens)
folium.LayerControl().add_to(map_clusters_tomar_popdens)
choropleth.geojson.add_child(
folium.features.GeoJsonTooltip(['name'], labels=False)
).add_to(map_clusters_tomar_popdens)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(tomar_merged['lat'],
tomar_merged['lng'],
tomar_merged['parish'],
tomar_merged['Cluster Labels']):
label = folium.Popup(str(poi) + ' Cluster ' + str(int(cluster)), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=5,
popup=label,
color=rainbow[int(cluster)-1],
fill=True,
fill_color=rainbow[int(cluster)-1],
fill_opacity=0.7).add_to(map_clusters_tomar_popdens)
porto= r'C:\Users\Pc\Desktop\Project\porto.json'
with open(porto) as pt:
prt = json.load(pt)
with open(porto) as pt:
pt_data=pt.readlines()
pt_data=[json.loads(line) for line in pt_data]
pt_data=pt_data[0]
tooltip_text=[]
for idx in range(78):
tooltip_text.append(pt_data['features'][idx]['properties'])
coord_porto=r'C:\Users\Pc\Desktop\Project\porto.xlsx'
df_porto = pd.read_excel(coord_porto)
df_porto.head()
porto_venues = getNearbyVenues(names=df_porto['parish'],
lat=df_porto['lat'],
lng=df_porto['lng'],
distance=df_porto['distance'])
# one hot encoding
porto_onehot = pd.get_dummies(porto_venues[['Venue cat']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
porto_onehot['parish'] = porto_venues['parish']
# move neighborhood column to the first column
fixed_columns = [porto_onehot.columns[-1]] + list(porto_onehot.columns[:-1])
porto_onehot = porto_onehot[fixed_columns]
print(porto_onehot.shape)
porto_grouped = porto_onehot.groupby('parish').mean().reset_index()
print(porto_grouped.shape)
porto_grouped.head()
num_top_venues = 15
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['parish']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
neighborhoods_venues_sorted_porto = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted_porto['parish'] = porto_grouped['parish']
for ind in np.arange(porto_grouped.shape[0]):
neighborhoods_venues_sorted_porto.iloc[ind, 1:] = return_most_common_venues(porto_grouped.iloc[ind, :], num_top_venues)
# set number of clusters
kclusters = 6 #run algorithm to choose best
porto_grouped_clustering = porto_grouped.drop('parish', 1)
# run k-means clustering
kmeanst = KMeans(n_clusters=kclusters, random_state=0).fit(porto_grouped_clustering)
# add clustering labels
neighborhoods_venues_sorted_porto.insert(0, 'Cluster Labels', kmeanst.labels_)
porto_merged = df_porto
porto_merged.columns=['parish','average age','unemployment','lat','lng','distance','pop density']
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
porto_merged.set_index('parish')
porto_merged = porto_merged.join(neighborhoods_venues_sorted_porto.set_index('parish'), on='parish')
porto_merged=porto_merged.drop(39)
porto_merged.head()
from scipy.spatial.distance import cdist
distortions = []
K = range(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k, random_state=0).fit(porto_grouped_clustering)
#kmeanModel.fit(istanbul_grouped_clustering)
distortions.append(sum(np.min(cdist(porto_grouped_clustering, kmeanModel.cluster_centers_, 'canberra'), axis=1)) / porto_grouped_clustering.shape[0])
# Plot the elbow
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()
threshold_scale = np.linspace(porto_merged['unemployment'].min(),
porto_merged['unemployment'].max(),
10, dtype=int)
threshold_scale = threshold_scale.tolist() # change the numpy array to a list
threshold_scale[-1] = threshold_scale[-1] + 1
map_clusters_porto_unemployment = folium.Map(location=[41.141127,-8.607638], zoom_start=11)
choropleth = folium.Choropleth(
geo_data = prt,
name = 'choropleth',
data = porto_merged,
columns = ['parish','unemployment'],
key_on = 'feature.properties.name',
threshold_scale=threshold_scale,
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.4,
legend_name='unemployment (%)',
#reset=True
).add_to(map_clusters_porto_unemployment)
folium.LayerControl().add_to(map_clusters_porto_unemployment)
choropleth.geojson.add_child(
folium.features.GeoJsonTooltip(['name'], labels=False)
).add_to(map_clusters_porto_unemployment)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lng, poi, cluster in zip(porto_merged['lat'],
porto_merged['lng'],
porto_merged['parish'],
porto_merged['Cluster Labels']):
label = folium.Popup(str(poi) + ' Cluster ' + str(int(cluster)), parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=5,
popup=label,
color=rainbow[int(cluster)-1],
fill=True,
fill_color=rainbow[int(cluster)-1],
fill_opacity=0.7).add_to(map_clusters_porto_unemployment)
threshold_scale = np.linspace(porto_merged['average age'].min(),
porto_merged['average age'].max(),
10, dtype=int)
threshold_scale = threshold_scale.tolist() # change the numpy array to a list
threshold_scale[-1] = threshold_scale[-1] + 1
map_clusters_porto_age = folium.Map(location=[41.141127,-8.607638], zoom_start=11)
choropleth = folium.Choropleth(
geo_data = prt,
name = 'choropleth',
data = porto_merged,
columns = ['parish','average age'],
key_on = 'feature.properties.name',
threshold_scale=threshold_scale,
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.4,
legend_name='average age',
#reset=True
).add_to(map_clusters_porto_age)
folium.LayerControl().add_to(map_clusters_porto_age)
choropleth.geojson.add_child(
folium.features.GeoJsonTooltip(['name'], labels=False)
).add_to(map_clusters_porto_age)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lng, poi, cluster in zip(porto_merged['lat'],
porto_merged['lng'],
porto_merged['parish'],
porto_merged['Cluster Labels']):
label = folium.Popup(str(poi) + ' Cluster ' + str(int(cluster)), parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=5,
popup=label,
color=rainbow[int(cluster)-1],
fill=True,
fill_color=rainbow[int(cluster)-1],
fill_opacity=0.7).add_to(map_clusters_porto_age)
threshold_scale = np.linspace(porto_merged['pop density'].min(),
porto_merged['pop density'].max(),
10, dtype=int)
threshold_scale = threshold_scale.tolist() # change the numpy array to a list
threshold_scale[-1] = threshold_scale[-1] + 1
map_clusters_porto_popdens = folium.Map(location=[41.141127,-8.607638], zoom_start=11)
choropleth = folium.Choropleth(
geo_data = prt,
name = 'choropleth',
data = porto_merged,
columns = ['parish','pop density'],
key_on = 'feature.properties.name',
threshold_scale=threshold_scale,
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.4,
legend_name='population density (Residents per km^2)',
#reset=True
).add_to(map_clusters_porto_popdens)
folium.LayerControl().add_to(map_clusters_porto_popdens)
choropleth.geojson.add_child(
folium.features.GeoJsonTooltip(['name'], labels=False)
).add_to(map_clusters_porto_popdens)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lng, poi, cluster in zip(porto_merged['lat'],
porto_merged['lng'],
porto_merged['parish'],
porto_merged['Cluster Labels']):
label = folium.Popup(str(poi) + ' Cluster ' + str(int(cluster)), parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=5,
popup=label,
color=rainbow[int(cluster)-1],
fill=True,
fill_color=rainbow[int(cluster)-1],
fill_opacity=0.7).add_to(map_clusters_porto_popdens)
Abrantes unemployment map:
count_venue_abrantes = abrantes_merged
count_venue_abrantes = count_venue_abrantes.drop(['parish','average age','unemployment','lat','lng','distance','pop density'], axis=1)
count_venue_abrantes = count_venue_abrantes.groupby(['Cluster Labels','1st Most Common Venue']).size().reset_index(name='Counts')
#we can transpose it to plot bar chart
cv_cluster_abrantes = count_venue_abrantes.pivot(index='Cluster Labels', columns='1st Most Common Venue', values='Counts')
cv_cluster_abrantesr = cv_cluster_abrantes.fillna(0).astype(int).reset_index(drop=True)
#creating a bar chart of "Number of Venues in Each Cluster"
frame_porto=cv_cluster_abrantes.plot(kind='bar',figsize=(20,8),width = 0.8)
plt.legend(labels=cv_cluster_abrantes.columns,fontsize= 14)
plt.title("Number of Venues in Each Cluster",fontsize= 16)
plt.xticks(fontsize=14)
plt.xticks(rotation=0)
plt.xlabel('Clusters', fontsize=14)
plt.ylabel('Number of Venues', fontsize=14)
map_clusters_abrantes_unemployment
Abrantes average age map:
map_clusters_abrantes_age
Abrantes population density map:
map_clusters_abrantes_denspop
Tomar unemployment map:
count_venue_tomar = tomar_merged
count_venue_tomar = count_venue_tomar.drop(['parish','average age','unemployment','lat','lng','distance','pop density'], axis=1)
count_venue_tomar = count_venue_tomar.groupby(['Cluster Labels','1st Most Common Venue']).size().reset_index(name='Counts')
#we can transpose it to plot bar chart
cv_cluster_tomar = count_venue_tomar.pivot(index='Cluster Labels', columns='1st Most Common Venue', values='Counts')
cv_cluster_tomar = cv_cluster_tomar.fillna(0).astype(int).reset_index(drop=True)
#creating a bar chart of "Number of Venues in Each Cluster"
frame_porto=cv_cluster_tomar.plot(kind='bar',figsize=(20,8),width = 0.8)
plt.legend(labels=cv_cluster_tomar.columns,fontsize= 14)
plt.title("Number of Venues in Each Cluster",fontsize= 16)
plt.xticks(fontsize=14)
plt.xticks(rotation=0)
plt.xlabel('Clusters', fontsize=14)
plt.ylabel('Number of Venues', fontsize=14)
map_clusters_tomar_unemployment
Tomar average age map:
map_clusters_tomar_age
Tomar population density map:
map_clusters_tomar_popdens
Porto unemployment map:
count_venue_porto = porto_merged
count_venue_porto = count_venue_porto.drop(['parish','average age','unemployment','lat','lng','distance','pop density'], axis=1)
count_venue_porto = count_venue_porto.groupby(['Cluster Labels','1st Most Common Venue']).size().reset_index(name='Counts')
#we can transpose it to plot bar chart
cv_cluster_porto = count_venue_porto.pivot(index='Cluster Labels', columns='1st Most Common Venue', values='Counts')
cv_cluster_porto = cv_cluster_porto.fillna(0).astype(int).reset_index(drop=True)
#creating a bar chart of "Number of Venues in Each Cluster"
frame_porto=cv_cluster_porto.plot(kind='bar',figsize=(20,8),width = 0.8)
plt.legend(labels=cv_cluster_porto.columns,fontsize= 14)
plt.title("Number of Venues in Each Cluster",fontsize= 16)
plt.xticks(fontsize=14)
plt.xticks(rotation=0)
plt.xlabel('Clusters', fontsize=14)
plt.ylabel('Number of Venues', fontsize=14)
map_clusters_porto_unemployment
Porto average age map:
map_clusters_porto_age
Porto population density map:
map_clusters_porto_popdens
In Abrantes, we have a clear central/more urban cluster, i.e. red cluster 0. We can see that the unemployment rate and the average age bear little weight on how this cluster is located and propagated. This cluster starts in the central area of Abrantes and propagates north. Potential reasons for this propagation:
Outside of the central cluster we begin to see more cluster diversity.
Abrantes used to be a port city, so more investment was made towards the central. With time, people started to leave central areas of Portugal, such as Abrantes, and moved towards the coast and big cities. To fight this tendency, Abrantes invested in touristic area near the nothern river which has interesting views and sites, as well as riverside beaches.
In tomar we see tha same phenomenon as in Abrantes, the existence of a central cluster and its propagation, but this time this propagation is towards East.
To the south we see a transitory zone, areas where people live between various big cities i.e. Torres Novas, Entroncamento.
To the north we have more rural areas, dominated by forests and farms.
Having this in mind, the potential reasons for this propagation of the central cluster towards East may be:
North of Tomar we begin to see the cluster diversity we have seen in south of Abrantes.
In Porto you we can see the same phenomenom, the propagation of central clusters. Here this propagation seems to be more towards north and coast.
There seems to be some support by population density closer to the central area and by lower unemployment rate further from the central area.
Possible reasons for this propagation:
In Porto we can see some possible effects of the unemployment rate, i.e., support of the central clusters propagation.
We can see some differences in the cities dynamics through this analysis.
First we see, in all three cities, a central cluster and peripheral clusters: